Challenge: What is the future selling price of a Home?

A home is often the largest and most expensive purchase a person makes in his or her lifetime. Ensuring homeowners have a trusted way to monitor this asset is incredibly important.

In this competition, students are required to develop a full-fledged approach to make predictions about the future sale prices of homes. A full-fledged approach constist, at least, in the following steps:

  • Descriptive statistics about the data
  • Data cleaning and pre-processing
  • Defining a modeling approach to the problem
  • Build such a statistical model
  • Validate the outcome of the model

Now, should you ask a home buyer to describe their dream house, they probably wouldn't begin with describing features such as the height of the basement ceiling or the proximity to a railroad. As you will see, the dataset we use in this competition proves that many more features influence price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in a small city in the US, this competition challenges you to predict the final price of each home.


The plan of attack

Here the summary of what we are going to do through the notebook

  • Load the data: dataset and testset
  • Load the features metadata (docs): meaning, description and constraints of the features
  • NaN values: how to manage/recover them
  • Fit the standard: recover/delete records which are not compliant with the integrity rules
  • Remap quantitative features: convert categorical features into dummy ones
  • Data Statistics: statistical description of the dataset and linear regression assumptions
  • Data Manipulation: data cleaning and pre-processing
  • Build the model: define the strategy to follow and build the final model
  • SalePrice prediction: evaluate the predictions and store them as CSV file
  • Final consideration: some final coments to what have been and have not been done through the notebook


Load the data

The first thing to do is to retrieve the raw data from disk: we have a dataset (that we will use to define and train our model) and a test set (on which we will make the model run, as required for the challenge).

In [ ]:
# missing python libraries
!pip install seaborn
!pip install statsmodels
In [2]:
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import re
import pprint
import json

from typing import *
In [3]:
# load the dataset
dataset = pd.read_csv('challenge_data/train.csv', sep=',', keep_default_na=True, header='infer' )
# load the test set
testset = pd.read_csv('challenge_data/test.csv', sep=',', keep_default_na=True, header='infer' )
# save the Id column for the final result
testset_Ids = testset['Id']
In [4]:
dataset.shape
Out[4]:
(1200, 81)
In [5]:
testset.shape
Out[5]:
(260, 80)

Let's have a look...

After a first look at the table, we can notice that there are some problems with some features where 'NA' is really a 'Not Available' entry. Pandas treats them as NaN values: we need to reconvert these values where required.

In [6]:
dataset.iloc[:10]
Out[6]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
5 6 50 RL 85.0 14115 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv Shed 700 10 2009 WD Normal 143000
6 7 20 RL 75.0 10084 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 307000
7 8 60 RL NaN 10382 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Shed 350 11 2009 WD Normal 200000
8 9 50 RM 51.0 6120 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2008 WD Abnorml 129900
9 10 190 RL 50.0 7420 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 WD Normal 118000

10 rows × 81 columns


Features metadata

The provided text file called "Data Description.rtf" is a sort of reference documentation for the features we have inside the dataset. As we can read, available features are both qualitative (Categorical data) and quantitative (Continuous data): they have a short description, we can use to understand what the feature actually is and what are its possible values (if qualitative).

The next step is to convert this documentation in a JSON file: it could be used to define integrity rules over the data and it could be more easily accessed by us if needed (feature lookup).

The .RTF file

The first 20 lines of the rtf file look like this:

In [7]:
! head -n 20 ./challenge_data/Data\ description.rtf
{\rtf1\ansi\ansicpg1252\cocoartf1561\cocoasubrtf200
{\fonttbl\f0\fmodern\fcharset0 Courier;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;}
{\*\expandedcolortbl;;\cssrgb\c0\c0\c0;}
\paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\deftab720
\pard\pardeftab720\sl280\partightenfactor0

\f0\fs24 \cf2 \expnd0\expndtw0\kerning0
\outl0\strokewidth0 \strokec2 

MSSubClass: Identifies the type of dwelling involved in the sale.	\
\
        20	1-STORY 1946 & NEWER ALL STYLES\
        30	1-STORY 1945 & OLDER\
        40	1-STORY W/FINISHED ATTIC ALL AGES\
        45	1-1/2 STORY - UNFINISHED ALL AGES\
        50	1-1/2 STORY FINISHED ALL AGES\
        60	2-STORY 1946 & NEWER\
        70	2-STORY 1945 & OLDER\

The idea is to remove the first 10 lines which are not useful and focus on the feature description only. We need also to remove the last line which looks likes this:

In [8]:
! tail -n 1 ./challenge_data/Data\ description.rtf
Ch}

Using bash commands and regular expression we can easily remove the lines and the '\' characters.
In addition, we want to add a character '#' to the beginning of each feature name which will be used to split the text.

In [9]:
! cat ./challenge_data/Data\ description.rtf | tail -n +12  | \
head -n -1 | sed -e 's,\\,,g' | sed -e 's,^\(\w*\:\),#\1,g' > ./challenge_data/feature_descriptions.txt
In [10]:
! head -n 10 ./challenge_data/feature_descriptions.txt
#MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES

We can now create a Python dictionary from this file and then convert it into the JSON format (exploiting the json library). The dictionary will contain a key for each feature, while each value will be another dictionary with the following keys:

  • descriprion: a string describing the feature
  • continuous: a boolean to state if the feature is quantitative or not
  • values: a dictionary containing, in case of qualitative features, the possible values the feature may present (as keys) and the mapping into a number (as values: could be useful for the future data preparation)
In [11]:
# read the RTF file
f = open('./challenge_data/feature_descriptions.txt', 'r')
text = f.read()
f.close()
# split the file in order to have a list of strings, each one containing a feature except the first element which is empty
features_text = text.split("#")[1:]
In [12]:
# sample
print(features_text[5])
Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access
		

In [13]:
features_metadata = {}

for feature_raw in features_text:
    # clean the text from some special token combination
    feature = (feature_raw
                        .replace('\t\t', '') 
                        .replace('\n\t\n', '\n\n') 
                        .replace('\n\t\t\n', '\n\n')
                        .replace('\n \n', '\n\n')
              )

    # split between the feature name and description and its values (if any)
    try:    
        tmp = feature.split('\n\n')
        feature_name = tmp[0]
        feature_values = tmp[1]
    except ValueError:
        print(indx)
    except IndexError:
        print(indx)
    # separate the name from the descritpion
    key, descr_raw = feature_name.split(':')
    # clean the descritpion leading white space
    description = re.sub('^ ', '', descr_raw)

    # populate the dictionary including the new feature and its description
    features_metadata[key] = {
        'description' : description,
    }

    # control if the feature is discrete or not and populate the entry
    if feature_values != '':
        # include the info in the dictionary
        features_metadata[key]['continuous'] = False
        features_metadata[key]['values'] = {}

        entries_cleaned = feature_values.replace('\t\n', '\n')

        # split the values (one per line)
        entries_split = [value for value in entries_cleaned.split('\n')]
        # filter the entries composed of a string of white spaces only
        entries_raw = [x for x in filter(lambda x: re.sub(' +', '', x), entries_split)]
        # split the value name from the descritpion
        entries_raw = [value_descr.split('\t') for value_descr in entries_raw]
        # filter the entries composed of a string of white spaces only
        entries = [ [x for x in filter(lambda x: re.sub(' +', '', x), entry_raw)] for entry_raw in entries_raw ]

        for indx, entry in enumerate(entries):
            # get the value and clean from whitespaces
            value = re.sub(' +', '', entry[0])
            # prepare the descritpion
            try:
                description = entry[1]
            except IndexError:
                print(indx)
            # create the entry with the feature value as key and include the descritpion
            features_metadata[key]['values'][value] = {
                'description': description,
                # we include also a numerical id
                'numerical_id': indx
            }
    else:
        features_metadata[key]['continuous'] = True   

To better visualize the data structure, let's see the result considering one feature:

In [14]:
pprint.pprint(features_metadata['Condition1'], depth=3)
{'continuous': False,
 'description': 'Proximity to various conditions',
 'values': {'Artery': {'description': 'Adjacent to arterial street',
                       'numerical_id': 0},
            'Feedr': {'description': 'Adjacent to feeder street',
                      'numerical_id': 1},
            'Norm': {'description': 'Normal', 'numerical_id': 2},
            'PosA': {'description': 'Adjacent to postive off-site feature',
                     'numerical_id': 6},
            'PosN': {'description': 'Near positive off-site feature--park, '
                                    'greenbelt, etc.',
                     'numerical_id': 5},
            'RRAe': {'description': 'Adjacent to East-West Railroad',
                     'numerical_id': 8},
            'RRAn': {'description': 'Adjacent to North-South Railroad',
                     'numerical_id': 4},
            'RRNe': {'description': "Within 200' of East-West Railroad",
                     'numerical_id': 7},
            'RRNn': {'description': "Within 200' of North-South Railroad",
                     'numerical_id': 3}}}

Feature names integrity

...Are we sure that the dataset's features have the same names specified in our documentation? Let us check it

In [15]:
# some features do not appear inside the documentation
no_docs_features = ['Id', 'SalePrice']

# check the number of available features
dataset_features = set(dataset.drop(no_docs_features, axis=1).columns.values)
docs_features = set(features_metadata.keys())
assert len(dataset_features) - len(docs_features) == 0, "Not equal length between the two sets"
In [16]:
# see which features are in the Dataset but not in the Metadata file
dataset_features - docs_features
Out[16]:
{'BedroomAbvGr', 'KitchenAbvGr'}
In [17]:
# see which features are in the Metadata file but not in the Dataset
docs_features - dataset_features
Out[17]:
{'Bedroom', 'Kitchen'}

The answer is no: in fact, we can see two discrepances in the names of this two features.

It looks like it is just an incongruence for the names (something like a typo): to be sure, let's see if the dataset dataset_features match the ones described inside the documentation (or, if quantitative, could be fine with the feature description and its nature)

In [18]:
# BedroomAbvGr and Bedroom
print("Unique values for 'BedroomAbvGr': {!s}".format(dataset['BedroomAbvGr'].unique()))
print("Description for 'Bedroom': \n {}".format(features_metadata['Bedroom']['description']))
Unique values for 'BedroomAbvGr': [3 4 1 2 0 5 6 8]
Description for 'Bedroom': 
 Bedrooms above grade (does NOT include basement bedrooms)
In [19]:
# KitchenAbvGr and Kitchen
print("Unique values for 'KitchenAbvGr': {!s}".format(dataset['KitchenAbvGr'].unique()))
print("Description for 'Kitchen': \n {}".format(features_metadata['Kitchen']['description']))
Unique values for 'KitchenAbvGr': [1 2 3 0]
Description for 'Kitchen': 
 Kitchens above grade

As we can see, the metadata description is very similar to the dataset's feature names. We can then correct the JSON Metadata file (our documentation) in order to make it compliant with the Dataset

In [20]:
# correct the documentation
features_metadata['BedroomAbvGr'] = features_metadata.pop('Bedroom')
features_metadata['KitchenAbvGr'] = features_metadata.pop('Kitchen')

The JSON file

Now, we can save on disk the documentatin in JSON format.

In [21]:
# save the JSON docs file
f = open('./challenge_data/feature_descritpion.json', 'w')
f.write(json.dumps(features_metadata))
f.close()

Feature metadata lookup

Given a feature name, we want a function able to to present the docs information about that feature in a more user friendly way

In [22]:
def PrintFeatureMetadata(feature: str, drop_id_col=True):
    """
    Function to properly print all the metadata about the passed feature name.
    It prints a dataframe with the values associated with that feature if discrate, 
    otherwise it doesn't print the values.
    
    Params:
        - feature: the name of the feature to be printed
        - drop_id_col: keep or not the 'numerical_id' column when printing
    """
    
    print("Feature name: {0!s}".format(feature))
    print("--- Continuous: {1!s}".format("", features_metadata[feature]['continuous']))
    print("--- Description: {1!s}".format("", features_metadata[feature]['description']))
    print('\n')
    
    if not features_metadata[feature]['continuous']:
        df = pd.DataFrame(data=features_metadata[feature]['values']).T
        df['values'] = df.index
        # select if visualize the numerical id or not
        if drop_id_col:
            col = ['values', 'description']
        else:
            col = ['values', 'numerical_id', 'description']
        print(
            df[col]
            .reset_index(drop=True)
            .sort_values(['values'], ascending=[1])
        )
        print('\n')
In [23]:
# an example
PrintFeatureMetadata('Condition1')
Feature name: Condition1
--- Continuous: False
--- Description: Proximity to various conditions


   values                                        description
0  Artery                        Adjacent to arterial street
1   Feedr                          Adjacent to feeder street
2    Norm                                             Normal
3    PosA               Adjacent to postive off-site feature
4    PosN  Near positive off-site feature--park, greenbel...
5    RRAe                     Adjacent to East-West Railroad
6    RRAn                   Adjacent to North-South Railroad
7    RRNe                  Within 200' of East-West Railroad
8    RRNn                Within 200' of North-South Railroad



Manage the NaNs

As we have seen before, Pandas automatically converts "NA" values as "NaN" ones (Not a Number). The problem is that there are actually some fields (categorical string features) that require to maintain "NA" as value.

In order to spot these fields and fix the problem, we exploit the created JSON metadata file

In [24]:
# create the dictionary whose keys are the fields that need to maintain the "NA" value
NaN_dict = {}
for k in features_metadata.keys():
    if not features_metadata[k]['continuous'] and 'NA' in features_metadata[k]['values']:
        NaN_dict[k] = 'NA'

# convert NaN as 'NA'
dataset = dataset.fillna(NaN_dict)
testset = testset.fillna(NaN_dict)

The dataset

Let us control the remaining dataset features who present NaN values:

In [25]:
NaN_dataset = dataset.isnull().sum().sort_values(ascending=False)[lambda x: x > 0]
for x in NaN_dataset.index:
    PrintFeatureMetadata(x)
Feature name: LotFrontage
--- Continuous: True
--- Description: Linear feet of street connected to property


Feature name: GarageYrBlt
--- Continuous: True
--- Description: Year garage was built


Feature name: MasVnrType
--- Continuous: False
--- Description: Masonry veneer type


    values   description
0   BrkCmn  Brick Common
1  BrkFace    Brick Face
2   CBlock  Cinder Block
3     None          None
4    Stone         Stone


Feature name: MasVnrArea
--- Continuous: True
--- Description: Masonry veneer area in square feet


In [26]:
sns.barplot(x=NaN_dataset.index, y=NaN_dataset)
_ = plt.xticks(rotation='45')

Our goal is to recover these "NaN" values.

According to the dataset documentation, we can decide to set them as:

  • "None", for the field "MasVnrType"
  • 0, for the fields "GarageYrBlt", "LotFrontage" and "MasVnrArea"

We will also create the boolean "missing_garage" field, in order to make the model understand when the apartment has no garage

In [27]:
# create the "missing_garage" field (which is not documented)
no_docs_features.append('missing_garage')
dataset['missing_garage'] = dataset['GarageYrBlt'].isna()

# recover NaN values
dataset = dataset.fillna(
    {
        'MasVnrType': 'None',
        'GarageYrBlt': 0,
        'LotFrontage': 0,
        'MasVnrArea': 0
    }
)

The test set

We now replicate what we have done before for the test set

In [28]:
NaN_testset = testset.isnull().sum().sort_values(ascending=False)[lambda x: x > 0]
for x in NaN_testset.index:
    PrintFeatureMetadata(x)
Feature name: LotFrontage
--- Continuous: True
--- Description: Linear feet of street connected to property


Feature name: GarageYrBlt
--- Continuous: True
--- Description: Year garage was built


Feature name: MasVnrType
--- Continuous: False
--- Description: Masonry veneer type


    values   description
0   BrkCmn  Brick Common
1  BrkFace    Brick Face
2   CBlock  Cinder Block
3     None          None
4    Stone         Stone


Feature name: MasVnrArea
--- Continuous: True
--- Description: Masonry veneer area in square feet


Feature name: Electrical
--- Continuous: False
--- Description: Electrical system


  values                                        description
0  FuseA  Fuse Box over 60 AMP and all Romex wiring (Ave...
1  FuseF     60 AMP Fuse Box and mostly Romex wiring (Fair)
2  FuseP  60 AMP Fuse Box and mostly knob & tube wiring ...
3    Mix                                              Mixed
4  SBrkr                  Standard Circuit Breakers & Romex


These fields are the same spotted for the dataset: but we have also "Electrical" now

In [29]:
sns.barplot(x=NaN_testset.index, y=NaN_testset)
_ = plt.xticks(rotation='45')
In [30]:
# create the "missing_garage" field
testset['missing_garage'] = testset['GarageYrBlt'].isna()
# recover NaN values
testset = testset.fillna(
    {
        'MasVnrType': 'None',        
        'GarageYrBlt': 0,
        'LotFrontage': 0,
        'MasVnrArea': 0
    }
)


Remapping the features

In order to create better models, categorical features need to be remapped as dummy ones (or ratings, when possible). In other words, each possible value for a given categorical feature will generate a new feature (and the original one will be dropped)

A class to describe data integrity rule

The class we are going to define acts as a dataset manager: given a dataset (Pandas Dataframe) and eventually its documentation (Python dictionary retrieved from a JSON file), we can define extra integrity rules (Python dictionary) for the data of each feature and spot records who present errors.

In this way, incongruencies can be spotted and recovered: unrecoverable errors can be deleted instead. This class wants to be a reusable and scalable solution, that can be adopted for datasets which present the same nature (in this way, integrity rules are defined only once), automatizing some of the data structure pre-processing operations.

In [31]:
class DataRules(object):
    '''
    This class defines a set of rules that can be used to check the integrity of the columns (features) of a dataset
    '''
    def __init__(self, dataset: pd.core.frame.DataFrame, key_field: str, json_doc: dict):
        self._rules = {}
        self._dataset = dataset
        self._key_field = key_field
        self._json_doc = json_doc
        
    def set_dataset(self, dataset: pd.core.frame.DataFrame, json_doc=None):
        '''
        Assign a specific dataset (and eventually a different documentation) to the dictionary of rules
        '''
        self._dataset = dataset
        if json_doc is not None:
            self._json_doc = json_doc
        
    def rules(self):
        ''' Retrieve the dictionary of rules '''
        return self._rules
    
    def data(self):
        ''' Retrieve the dataset attached to the rules '''
        return self._dataset
    
    def docs(self):
        ''' Retrieve the dataset documentation '''
        return self._json_doc
    
    def add_rule(self, for_columns: Iterable, rule: Callable):
        '''
        Associate a rule to one or more columns, whose names are passed as a list
        '''
        for col in for_columns:
            if col in self._rules:
                # some rules are already existing for this column: add this one too
                self._rules[col].append(rule)
            else:
                # no rules have been already defined for this column
                self._rules[col] = [rule]
    
    def clear_rules(self):
        '''
        Delete all the rules
        '''
        self._rules = {}
    
    def clear_rules_for(self, column_names: Iterable):
        '''
        Delete all the rules associated to the columns passed as parameter
        '''
        for col in column_names:
            if col in self._rules:
                self._rules.pop(col)
    
    def destroy(self):
        '''
        Destroy this object and free the memory
        '''
        # Python does not allow you to explicitly free the memory...
        # But what is not referenced anymore will be "deleted" automatically by the garbage collector
        self._rules = {}
        self._dataset = None
        self._key_field = None
        self._json_doc = None
    
    def spot_wrong_records(self, verbose=False):
        '''
        Print records of the dataset that do not satisfy the rules. 
        For simplicity, only the key and the field with the wrong value are printed.
        '''
        ok = True
        # consider every field of the dataframe
        for col in self._dataset: #(set(self._rules.keys()) - set(['SalesPrice'])):
            # some columns have no documentation: we need to skip them
            if col in ['SalePrice', 'Id', 'missing_garage']:
                continue
            
            log = ''
            anomalies = list()
            # apply user-defined integrity rules, if present
            if col in self._rules:
                for rule in self._rules[col]:
                    # retrieve non-compliant records
                    res = self._dataset.filter(items=[self._key_field, col])[rule(self._dataset[col]) == False]
                    if not res.empty:
                        # update result information
                        log += str(res) + '\n'
                        anomalies.append(res[col])
            # extra check for the categorical fields
            if 'values' in self._json_doc[col]:
                # some fields are categorical: we need to check if their values are compliant with the documentation
                possible_values = set(self._json_doc[col]['values'])
                current_values = set(map(str, self._dataset[col].unique()))   # JSON keys are always strings
                # if there are values which are not described inside the documentation, they must be wrong
                irregular_values = current_values - possible_values
                if len(irregular_values) > 0:
                    # the NaN value has to be properly managed
                    if 'nan' in irregular_values:
                        irregular_values.add(np.NaN)
                        irregular_values.remove('nan')
                    # retrieve irregular records
                    res = self._dataset.filter(items=[self._key_field, col])[self._dataset[col].isin(irregular_values)]
                    log += str(res) + '\n'
                    anomalies.append(res[col])
                    
            # final results
            anomaly_found = log != ''
            if anomaly_found:
                ok = False
                anomalies = sorted(pd.concat(anomalies).unique())
            if verbose:
                print('\n*** Analysing: "{}" ***'.format(col))
                if anomaly_found:
                    print('Problems found!')
                    print('Critical values: {}'.format(anomalies))
                    print(log)
                else:
                    print('No problems found.')
            else:
                if anomaly_found:
                    print('Field: {}'.format(col))
                    print('Wrong values: {}\n'.format(anomalies))
            # if everything was ok, print a message
        if ok:
            print('Everything all right!')
        
    
    def clean_dataset(self):
        '''
        Filter records of the dataset that do not satisfy the rules
        '''
        original_size = len(self._dataset)
        # consider every field of the dataframe
        for col in self._dataset: #(set(self._rules.keys()) - set(['SalesPrice'])):
            if col in ['SalePrice', 'Id', 'missing_garage']:
                continue
            
            # check user-defined integrity rules, if present
            if col in self._rules:
                for rule in self._rules[col]:
                    # keep only regular records
                    self._dataset = self._dataset[rule(self._dataset[col])]
            # extra check for the categorical fields
            if 'values' in self._json_doc[col]:
                # some fields are categorical: we need to check if their values are compliant with the documentation
                possible_values = set(self._json_doc[col]['values'])
                current_values = set(map(str, self._dataset[col].unique()))   # JSON keys are always strings
                # if there are values which are not described inside the documentation, they must be wrong
                irregular_values = current_values - possible_values
                if len(irregular_values) > 0:
                    # the NaN value has to be properly managed
                    if 'nan' in irregular_values:
                        irregular_values.add(np.NaN)
                        irregular_values.remove('nan')
                    # delete records with wrong values for the given field
                    self._dataset = self._dataset[~self._dataset[col].isin(irregular_values)]
        # all right
        resulting_size = len(self._dataset)
        print('Cleaning completed: {} records have been deleted'.format(original_size - resulting_size))
        return self._dataset

Dataset wrong records

First of all, we instantiate the DataRules class: for the moment, we are going to consider the dataset and we are going to define integrity rules for the features

In [32]:
# create the dictionary of rules
apartmentRules = DataRules(dataset, 'Id', features_metadata)

Because several features share the same integrity rules, they are divided into logical families and processed together

In [33]:
# define features integrity rules
# divide columns of the dataset into families, according to the nature of their data
columns = {}
columns['all'] = set(dataset.columns)

# numbers: positive integers
columns['int_non_zero'] = set(['Id', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice'])
apartmentRules.add_rule(columns['int_non_zero'], lambda x: x > 0)

columns['int'] = set(['EnclosedPorch', 'MasVnrArea', 'LotFrontage', '2ndFlrSF', 'PoolArea', 'OpenPorchSF', 'TotalBsmtSF', 'WoodDeckSF', 'ScreenPorch', 'BsmtFinSF1', 'MiscVal', 'GarageArea', 'LowQualFinSF', '3SsnPorch', 'BsmtFinSF2', 'BsmtUnfSF'])
apartmentRules.add_rule(columns['int'], lambda x: x >= 0)

# small numbers: again, positive integers
columns['small_int'] = set(['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'KitchenAbvGr', 'BedroomAbvGr'])
apartmentRules.add_rule(columns['small_int'], lambda x: x >= 0)

# years
columns['year'] = set(['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold'])
# no useful rules can be defined for this class of fields

# months of the year (numerical)
columns['month'] = set(['MoSold'])
apartmentRules.add_rule(columns['month'], lambda x: (x >= 1) & (x <= 12))  

Once the integrity rules have been defined, we can check if there are some incongruencies inside the dataset

In [34]:
apartmentRules.spot_wrong_records()
Field: MSZoning
Wrong values: ['C (all)']

Field: Neighborhood
Wrong values: ['NAmes']

Field: BldgType
Wrong values: ['2fmCon', 'Duplex', 'Twnhs']

Field: Exterior1st
Wrong values: ['Wd Sdng']

Field: Exterior2nd
Wrong values: ['Brk Cmn', 'CmentBd', 'Wd Sdng', 'Wd Shng']

Now, we would like to understand if some of those errors are recoverable (eg: incongruencies with the syntax declared inside the documentation) or if they need to be deleted

MSZoning

In [35]:
# let us consider the 'MSZoning' field
print('Current values: {}'.format(dataset['MSZoning'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['MSZoning']['values'].keys())))
Current values: ['RL' 'RM' 'C (all)' 'FV' 'RH']
Documentation approves: ['FV', 'RL', 'A', 'C', 'RP', 'I', 'RH', 'RM']

It looks like that "C" and "C (all)" could be actually the same value: let us check it

In [36]:
features_metadata['MSZoning']['values']['C']['description']
Out[36]:
'Commercial'

We think they have the same meaning: for this reason, every "C (all)" value inside the dataset will be converted as "C"

In [37]:
dataset['MSZoning'] = dataset['MSZoning'].map(lambda x: 'C' if x == 'C (all)' else x)

Neighborhood

In [38]:
# let us consider the 'Neighborhood' field
print('Current values: {}'.format(dataset['Neighborhood'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['Neighborhood']['values'].keys())))
Current values: ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
Documentation approves: ['Crawfor', 'Mitchel', 'Veenker', 'Somerst', 'SawyerW', 'Blmngtn', 'NPkVill', 'OldTown', 'BrkSide', 'BrDale', 'ClearCr', 'NoRidge', 'Gilbert', 'NridgHt', 'Timber', 'CollgCr', 'Sawyer', 'Names', 'StoneBr', 'Edwards', 'MeadowV', 'NWAmes', 'SWISU', 'IDOTRR', 'Blueste']

This is clearly a syntax error: the correct value should be "Names" (instead of "NAmes"). We are going to fix it

In [39]:
dataset['Neighborhood'] = dataset['Neighborhood'].map(lambda x: 'Names' if x == 'NAmes' else x)

BldgType

In [40]:
# let us consider the 'BldgType' field
print('Current values: {}'.format(dataset['BldgType'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['BldgType']['values'].keys())))
Current values: ['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
Documentation approves: ['1Fam', 'TwnhsI', 'TwnhsE', '2FmCon', 'Duplx']

While '2fmCon' and 'Duplex' can be easily seen as syntax error, one may argue that 'Twnhs' could refer both to 'TwnhsE' or 'TwnhsI'. Anyway, we think that the desired one is 'TwnhsI', because the other one is already present inside the dataset. We proceed with the data cleaning

In [41]:
d = {
    '2fmCon': '2FmCon',
    'Duplex': 'Duplx',
    'Twnhs': 'TwnhsI'
}

dataset['BldgType'] = dataset['BldgType'].map(lambda x: d[x] if x in d else x)

Exterior1st

In [42]:
# let us consider the 'Exterior1st' field
print('Current values: {}'.format(dataset['Exterior1st'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['Exterior1st']['values'].keys())))
Current values: ['VinylSd' 'MetalSd' 'Wd Sdng' 'HdBoard' 'BrkFace' 'WdShing' 'CemntBd'
 'Plywood' 'AsbShng' 'Stucco' 'BrkComm' 'AsphShn' 'Stone' 'ImStucc']
Documentation approves: ['CBlock', 'BrkComm', 'VinylSd', 'AsbShng', 'Plywood', 'Stone', 'CemntBd', 'WdSdng', 'Other', 'PreCast', 'WdShing', 'ImStucc', 'BrkFace', 'MetalSd', 'HdBoard', 'AsphShn', 'Stucco']

Again, the value 'Wd Sdng' was another incongruency (should be 'WdSdng'): correct it

In [43]:
dataset['Exterior1st'] = dataset['Exterior1st'].map(lambda x: 'WdSdng' if x == 'Wd Sdng' else x)

Exterior2nd

In [44]:
# let us consider the 'Exterior2nd' field
print('Current values: {}'.format(dataset['Exterior2nd'].unique()))
print('Documentation approves: {}'.format(list(features_metadata['Exterior2nd']['values'].keys())))
Current values: ['VinylSd' 'MetalSd' 'Wd Shng' 'HdBoard' 'Plywood' 'Wd Sdng' 'CmentBd'
 'BrkFace' 'Stucco' 'AsbShng' 'Brk Cmn' 'ImStucc' 'AsphShn' 'Stone'
 'Other']
Documentation approves: ['CBlock', 'BrkComm', 'VinylSd', 'AsbShng', 'Plywood', 'Stone', 'CemntBd', 'WdSdng', 'Other', 'PreCast', 'WdShing', 'ImStucc', 'BrkFace', 'MetalSd', 'HdBoard', 'AsphShn', 'Stucco']

All the four spotted wrong values are actually incongruencies:

  • 'Brk Cmn' should be 'BrkComm'
  • 'CmentBd' should be 'CemntBd'
  • 'Wd Sdng' should be 'WdSdng'
  • 'Wd Shng' should be 'WdShing'
In [45]:
d = {
    'Brk Cmn': 'BrkComm',
    'CmentBd': 'CemntBd',
    'Wd Sdng': 'WdSdng',
    'Wd Shng': 'WdShing'
}

dataset['Exterior2nd'] = dataset['Exterior2nd'].map(lambda x: d[x] if x in d else x)

We can now see that all the dataset incongruency errors have been recovered

In [46]:
apartmentRules.spot_wrong_records()
Everything all right!

Test set wrong records

We repeat now the procedure for the test set, deleting eventual unexpected (because clearly wrong) contents. Integrity rules to apply are the same as before, so we do not need to reinstantiate a new rules dictionary:

In [47]:
# assign the rules to the test set
apartmentRules.set_dataset(testset)

# and see if there are wrong records
apartmentRules.spot_wrong_records()
Field: MSZoning
Wrong values: ['C (all)']

Field: Neighborhood
Wrong values: ['NAmes']

Field: BldgType
Wrong values: ['2fmCon', 'Duplex', 'Twnhs']

Field: Exterior1st
Wrong values: ['Wd Sdng']

Field: Exterior2nd
Wrong values: ['CmentBd', 'Wd Sdng', 'Wd Shng']

Field: Electrical
Wrong values: [nan]

Apart the "Electrical" feature, these incongruencies are the same spotted before for the dataset. We can then proceed to the cleaning in the same way as before

In [48]:
# MSZoning
testset['MSZoning'] = testset['MSZoning'].map(lambda x: 'C' if x == 'C (all)' else x)

# Neighborhood
testset['Neighborhood'] = testset['Neighborhood'].map(lambda x: 'Names' if x == 'NAmes' else x)

# BldgType
d = {
    '2fmCon': '2FmCon',
    'Duplex': 'Duplx',
    'Twnhs': 'TwnhsI'
}
testset['BldgType'] = testset['BldgType'].map(lambda x: d[x] if x in d else x)

# Exterior1st
testset['Exterior1st'] = testset['Exterior1st'].map(lambda x: 'WdSdng' if x == 'Wd Sdng' else x)

# Exterior2nd
d = {
    'CmentBd': 'CemntBd',
    'Wd Sdng': 'WdSdng',
    'Wd Shng': 'WdShing'
}
testset['Exterior2nd'] = testset['Exterior2nd'].map(lambda x: d[x] if x in d else x)

Electrical

The test set records cannot be deleted: we need to keep it because we are required to predict the SalePrice for all of them.

So, we have to find a way to recover the incongruency for the only field who present "NaN" as a value for this feature.
Note that, if the problem would have appeared in the dataset, this record could have simply been deleted

In [49]:
# the incongruent record
testset[testset['Electrical'].isna()]
Out[49]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition missing_garage
179 1380 80 RL 73.0 9735 Pave NA Reg Lvl AllPub ... 0 NA NA NA 0 5 2008 WD Normal False

1 rows × 81 columns

In [50]:
# existing values for this feature
sns.barplot(x=testset['Electrical'].index, y=testset['Electrical'])
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f728bbf0cc0>
In [51]:
# documentation about this feature
PrintFeatureMetadata('Electrical')
Feature name: Electrical
--- Continuous: False
--- Description: Electrical system


  values                                        description
0  FuseA  Fuse Box over 60 AMP and all Romex wiring (Ave...
1  FuseF     60 AMP Fuse Box and mostly Romex wiring (Fair)
2  FuseP  60 AMP Fuse Box and mostly knob & tube wiring ...
3    Mix                                              Mixed
4  SBrkr                  Standard Circuit Breakers & Romex


It looks like none of the possible values are able to explicit that no information is available about the electrical system (or that the system is missing).
We decide to fix the flaw replacing "NaN" with "FuseF", which is the most common value for this feature inside the test set

In [52]:
testset['Electrical'] = testset['Electrical'].map(lambda x: 'FuseF' if x is np.NaN else x)
In [53]:
# check that all the incongruency errors have been recovered
apartmentRules.spot_wrong_records()
Everything all right!

Both the dataset and the test set are now compliant with the standard.

In [54]:
# we do not need the rules dictionary anymore
apartmentRules.destroy()


Remap categorical features

The goal now is to convert qualitative features (string type fields, or ratings) into boolean dummy ones.

In other words, these features will be converted in as many boolean features as their possible values described inside the documentation. The original predictors (categorical) will be deleted. In order to speed up this process, we define a proper method.

In [55]:
def remap_categorical_fields(data):
    '''
    This function convert several categorical fields as numerical ratings or boolean values.
    In order to do that, it is required that field's categories can be compared each other (they values can be sorted by weight/importance)
    '''
    
    def process_cols(cols_list, dictionary, compulsory=True):
        '''
        Transform a list of columns (fields) according to the values mapped by the dictionary
        '''
        for c in cols_list:
            if not compulsory:
                # create a boolean field to indicate if the value is missing or not
                data['{}_missing'.format(c)] = data[c].map(lambda x: x == 'NA')
            # remap the values
            data[c] = data[c].map(lambda x: dictionary[x])
    
    
    # "CentralAir" is a boolean field
    d = {'Y': 1, 'N': 0}
    l = ['CentralAir']
    process_cols(l, d)
    
    # "Id" is not of interest
    data = data.drop(['Id'], axis=1)
    
    # final result: dummy features
    return pd.get_dummies(data)

Process the dataset

In [56]:
# transform discrete-value columns as booleans
dataset = remap_categorical_fields(dataset)
In [57]:
# final result
dataset[:6]
Out[57]:
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 60 65.0 8450 7 5 2003 2003 196.0 706 0 ... 0 0 0 1 0 0 0 0 1 0
1 20 80.0 9600 6 8 1976 1976 0.0 978 0 ... 0 0 0 1 0 0 0 0 1 0
2 60 68.0 11250 7 5 2001 2002 162.0 486 0 ... 0 0 0 1 0 0 0 0 1 0
3 70 60.0 9550 7 5 1915 1970 0.0 216 0 ... 0 0 0 1 1 0 0 0 0 0
4 60 84.0 14260 8 5 2000 2000 350.0 655 0 ... 0 0 0 1 0 0 0 0 1 0
5 50 85.0 14115 5 5 1993 1995 0.0 732 0 ... 0 0 0 1 0 0 0 0 1 0

6 rows × 294 columns

Process the test set

In [58]:
# transform discrete-value columns as booleans
testset = remap_categorical_fields(testset)
In [59]:
# final result
testset[:6]
Out[59]:
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... SaleType_COD SaleType_CWD SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 20 71.0 9353 4 5 1970 1970 0.0 0 0 ... 0 0 0 1 0 1 0 0 0 0
1 60 80.0 10400 7 5 1998 1998 0.0 0 0 ... 0 0 0 0 1 0 0 0 1 0
2 50 50.0 6000 5 8 1925 1997 0.0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 20 75.0 9750 7 5 2000 2001 171.0 0 0 ... 0 0 0 0 1 0 0 0 1 0
4 20 78.0 10140 5 6 1975 1975 0.0 788 0 ... 0 0 0 0 1 0 0 0 1 0
5 20 90.0 14684 7 7 1990 1991 234.0 485 177 ... 0 0 0 0 1 0 0 0 1 0

6 rows × 262 columns

In [60]:
# we are going to modify the data: keep track of the current DataFrames
dataset_complete = dataset.copy()
testset_complete = testset.copy()


Descriptive statistics about the data

Now, the dataset structure is ready and we can focus on analysing our data. The idea is to describe them statistically in order to discover feature properties and/or mutual relations, so that we can understand which model we can implement and how to do it in the best way.

To resume, we want to understand:

  • which kind of model we can/should implement (check that model assumptions hold)
  • which features are the most interesting (useful to predict the SalePrice)
  • discover eventual synergy effects between features

Is there a relation between the target and the predictors?

First of all, we want to understand if at least one of the features can help us to predict the SalePrice (that is, there is a relationship between these two variables).

More specifically, what we are going to check is a linear additive relationship. The way to proceed is to perform an hypotesis test, in which:

  • $H_0$ (null hyp.): no feature has a relationship with the SalePrice
  • $H_1$ (alternate hyp.): at least one feature has it.

This test is performed by computing the F-statistic:

$$ F = \frac{(TSS-RSS)\ /\ p}{RSS\ /\ (n-p-1)} $$

fitting a multiple linear regression model on the data

$$ \hat{y_i} = \hat{\beta_0} + \sum_{i=1}^p \hat{\beta_i} x_i $$

where: $$ TSS = \sum_{i=1}^n (y_i - \bar{y})^2 $$

$$ RSS = \sum_{i=1}^n (y_i - \hat{y_i})^2 $$

$$ \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i $$ $$ p = \text{number of dataset's features} $$ $$ n = \text{number of dataset's records} $$

In [61]:
from sklearn import linear_model


def features_and_target(apartments_df, target='SalePrice'):
    '''
    Split the apartments DataFrame in features (X) and targets (y)
    '''
    X = apartments_df.loc[:, apartments_df.columns != target]
    y = apartments_df[target]
    return X, y

def lmr_model_stats(X, Y, verbose=True):
    '''
    This method can be used to evaluate e linear multiple regression model on the fly and retrieve some performance
    statistics, about how it is able to fit the data and explain the variance of the response variables using the features
    
    Printed information are:
    - MSE
    - RSE
    - R2 score
    - F-statistics
    
    Input parameters are:
    - X: the data on which fit the model (features)
    - Y: the targets
    '''
    # number of records and number of features
    n, p = X.shape
    
    # evaluate the model (Multiple Linear Regression)
    lr = linear_model.LinearRegression()
    model = lr.fit(X, Y)

    # retrieve predictions: y^
    Y_pred = lr.predict(X)

    # compute MSE, RSE and R2 score
    mse = sklearn.metrics.mean_squared_error(Y, Y_pred)
    rse = np.sqrt(mse*n / (n-p-1))
    R2 = sklearn.metrics.r2_score(Y, Y_pred)

    # compute F-statistic
    y_mean = np.mean(Y)
    tmp = Y-y_mean
    TSS = np.dot(tmp, tmp)
    RSS = mse*n
    F = ((TSS-RSS)/p) / (RSS/(n-p-1))

    # results
    if verbose:
        print('MSE = {}'.format(mse))
        print('RSE = {}'.format(rse))
        print('R2 = {}'.format(R2))
        print('F-statistic: {}'.format(F))
    else:
        return mse, rse, R2, F
In [62]:
# isolate features and target
X, y = features_and_target(dataset)
# evaluate model performance statistics
lmr_model_stats(X, y)
MSE = 419475386.5784634
RSE = 23571.09540336471
R2 = 0.9361238891746472
F-statistic: 45.316403992939875

$R^2$ and RSE (Residual Standard Error) are numerical values we can use to score and interpret the results our model can produce. As a general rule, $R^2$ should be as high as possible (its possible values are between 0 and 1, both included) while RSE should be low (its minimum possible value is 0).
They can be used to compare several models, built upon several opearting conditions, and chose the best one.

The F-statistic is a number equal or greater than 1: the higher it is, the more confident we are in rejecting the null hypotesis. The fact we obtained more than 47 allows us to think that at least one of the dataset's features could be used to predict the target.

We feel that the multiple linear regression could be an interesting model choice, but we need to verify that its underlying assumptions are met.

Which are the most interesting features?

We know that at least one of the feature has a relation with the response variable. So, the question now is: are all the features really useful?

We expect that only some of the features are really of interest when predicting the SalePrice: also, we are interested in reducing the dimensionality of the problem. The reason is that, focusing only on the most important predictors, we are able to generate a model:

  • faster (the procedure becomes computationally lighter)
  • with lower risk of overfitting (we get rid of redundant data)
  • with higher accuracy (we get rid of potential misleading data)

References

  • An introduction to the available feature selection methods inside the Scikit-Learn library (link)
  • The documentation for "ExtraTreesClassifier"

The reason we have chosen this method (the fourth, inside the article) to perform the feature selection is that we don't have to specify a priori the number of desired features. Also, it allows us to see the importance score for each predictor and have a global view of the situation

In [67]:
from sklearn.ensemble import ExtraTreesClassifier

def feature_importance(X, y, threshold=None, importances=None, method='extraTrees', figsize=(70, 10)):
    '''
    Feature extraction: repeat the analysis several times, and average the results.
    If a threshold is specified, it highlights only the feature whose importance is greater than the specified value
    If the importances are already given, the computation is skipped.
    
    "method" is chosen accordingly to the technique we want to use to perform the feature selection
    - "extraTrees": use the ExtraTreesClassifier
    - "gradBoosting": use GradientBoostingRegressor
    
    Returns an array, whose values are the importance value associated to the "i-th" feature
    '''
    
    if method == 'extraTrees':
        N_trees = 100
        model = ExtraTreesClassifier(n_estimators=N_trees, bootstrap=True)
    elif method == 'gradBoosting':
        model = sklearn.ensemble.GradientBoostingRegressor()
    else:
        raise ValueError('The specified method is invalid. Try "extraTrees" or "gradBoosting".')
    
    # compute importance values
    if importances is None:
        res = None
        n = 25
        for i in range(n):
            model.fit(X, y)
            if res is None:
                res = model.feature_importances_
            else:
                res += model.feature_importances_
        res /= n
    else:
        res = importances
            
    # visualize results
    plt.figure(figsize=figsize)
    plt.title("Feature importance", fontsize=25, y=1.02)
    if threshold is None:
        # all the feature
        sns.barplot(x=X.columns, y=res)
    else:
        # only the most important ones
        colorLight = '#81ecec'
        colorDark = '#00b894'
        colors = [colorDark if _y >= threshold else colorLight for _y in res]
        # plot the bars and the threshold
        sns.barplot(x=X.columns, y=res, palette=colors)
        plt.plot(X.columns, [threshold]*len(X.columns), 'c-')

    plt.xticks(rotation='90')
    plt.grid(True)
    
    # return importance values
    return res
Thank to the methods of the scikit-learn library, we are able to visualize the relative importance every feature has to predict the SalePrice value.
These scores are normalized: their sum is equal to 1
In [68]:
res = feature_importance(X, y)
You can find the full resolution image here

We need now to decide which features we want to keep: the idea is to drop the features whose importance is lower than twice the importance mean value

In [69]:
m = np.mean(res)
print(m)
0.0034129692832764505
In [70]:
colorThreshold = 2*m
_ = feature_importance(X, y, colorThreshold, res)
You can find the full resolution image here
In [71]:
n_total = len(res)
n_selected = (res > colorThreshold).sum()
w_selected = res[res > colorThreshold].sum()

print('Original number of features: {}'.format(n_total))
print('Selected features: {} ({} %)'.format(n_selected, round(n_selected/n_total*100, 2)))
print('Selected features total weight: {}'.format(round(w_selected, 2)))
Original number of features: 293
Selected features: 41 (13.99 %)
Selected features total weight: 0.47
In [72]:
# drop the non-interesting features
dataset = dataset.filter(items=np.append(X.columns[res > colorThreshold].values, 'SalePrice'))
X = X.filter(items=X.columns[res > colorThreshold])

How would the model perform now?

Let us recompute the model performance metrics, about how it is able to fit the data and explain the variance of the target using the predictors

In [73]:
# evaluate model performance statistics
lmr_model_stats(X, y)
MSE = 935918705.4700272
RSE = 31142.63867260217
R2 = 0.8574818717213536
F-statistic: 169.93371033317615

As a consequence of reducing the number of considered features, we can see that the model overfits less the data. Because of that, MSE and $R^2$ score are worse: anyway, we can see that filtering a subset of features guarantees us more meaningful results (the F-statistic is much more higher in this case).

Collinearity analysis

The reason of this analysis is that the presence of collinearity (in other words: high correlation between features) can pose problems in the regression context. In fact, it makes difficult to separate out the individual effects of collinear variables on the response (target)

The features correlation matrix

Now, we are going to compute and analyze the correlation matrix between the features.

In [74]:
# compute the correlation matrix
correlation_df = dataset.corr()
In [75]:
# define a mask for the upper triangular value of the matrix (it's simmetric, with the diagonal being 1)
mask = np.zeros_like(correlation_df)
mask[np.triu_indices_from(mask)] = True

# plot the correlation matrix
plt.figure(figsize=[40,40])
_ = sns.heatmap(correlation_df, mask=mask, vmin=-1, vmax=1, square=True, linewidths=.6, \
                annot=True, fmt=".3f", cmap="coolwarm", cbar_kws={"shrink": 0.5})
You can find the full resolution image here

The obtained correlation matrix is huge, but a first look is enough to understand that several features are actually correlated each other (blue and red cells).

Now, consider only the correlation between each feature and the target variable

In [76]:
# prepare the dataframe from the series returned by correlation_df['SalePrice']
sale_price_corr_df = pd.DataFrame(correlation_df['SalePrice']).rename(columns={'SalePrice': 'Correlation'})
sale_price_corr_df['Features'] = sale_price_corr_df.index
# remove SalePrice among the features
sale_price_corr_df = sale_price_corr_df[sale_price_corr_df.Features != 'SalePrice']
In [77]:
# change the color for the highest bars
colorLight = '#7FB3D5'
colorDark = '#0984e3'
colorThreshold = 0.60
colors = [colorDark if abs(_y) >= colorThreshold else colorLight for _y in sale_price_corr_df['Correlation']]

# plot the graph
plt.figure(figsize=[20,8])
plt.title("SalePrice correlation between the other numerical features", fontsize=25, y=1.02)
ax = sns.barplot(x='Features', y='Correlation', data=sale_price_corr_df, palette=colors)
_, _ = plt.xticks(rotation='90')
plt.grid(True)
You can find the full resolution image here

As we can observe, we have derived some insights from the correlation. There are features like OverallQual, GrLivArea and GarageCars which have a high linear correlation between their values and the final price. That is, as this values increases also the price increases.

Let us focus on features with a correlation higher (in absolute value) than .60

In [78]:
correlated_features_df = sale_price_corr_df[sale_price_corr_df.Correlation > colorThreshold].sort_values('Correlation', ascending=0)
correlated_features_df
Out[78]:
Correlation Features
OverallQual 0.789142 OverallQual
GrLivArea 0.740379 GrLivArea
TotalBsmtSF 0.641917 TotalBsmtSF
GarageCars 0.640961 GarageCars
GarageArea 0.623329 GarageArea
1stFlrSF 0.618556 1stFlrSF
In [79]:
# correlated features
features = list(correlated_features_df.index)

correlated_df = dataset[features].corr()
# define a mask for the upper triangular value of the matrix (it's simmetric with the diagonal being 1)
mask = np.zeros_like(correlated_df)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=[12,12])
_ = sns.heatmap(correlated_df, mask=mask, vmin=0.0, vmax=1, square=True, linewidths=.6, \
                annot=True, fmt=".3f", cmap="YlGnBu", cbar_kws={"shrink": 0.7})
_, _ = plt.yticks(rotation=0)
_, _ = plt.xticks(rotation=45)

The predictors the most correlated to the target are also correlated each other: potentially, it could mean that some of those features are not really interesting. It may be that some of these variables seems to be useful to predict the target not because they really are, but only because they are correlated to other useful variables.

We would like to get rid of these unnecessary predictors: anyway, we know that looking at the correlation matrix could not be sufficient to spot them. There may exist situations in which the collinearity involves three or more features, even if no pair of variables has a particularly high correlation. This situation is known as multicollinearity.

Because of the results we have observed, it seems that we should perform the collinearity analysis in order to check that all the requirements to build a linear regression model are really satisfied.

How to proceed: VIF values

A way to spot the problem is to evaluate the VIF (Variance Inflation Factor) for each feature

$$ \text{VIF}(\hat{\beta_j}) = \frac{1}{1 - R^2_{X_j | X_{-j}}} $$

where $ R^2_{X_j | X_{-j}} $ is the $ R^2 $ from a regression of $ X_j $ onto all of the other predictors. If this value is close to 1, collinearity is present and the VIF will be large.

The collinearity problem itself can be solved in two ways:

  • drop part of the problematic features
  • try to combine collinear variables together into a single predictor

Anyway, we are not able to merge together our features: they are quite heterogeneous and they are the result of a previous selection procedure.
So, the idea is to drop those variables for which the VIF value results to be high.

Remove problematic features

In general, a predictor is considered "problematic" if its VIF value is higher than 5

In [80]:
# we implement now methods to manage VIF values, inspired by the code found on the Internet
# https://stats.stackexchange.com/questions/155028/how-to-systematically-remove-collinear-variables-in-python

# docs: http://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html
from statsmodels.stats.outliers_influence import variance_inflation_factor    


def collinearity_analysis(X, thresh=5, drop=True, verbose=False):
    '''
    Given a Pandas dataframe ("X"), we evaluate the VIF for every feature.
    If "drop" = True, problematic features are directly deleted: otherwise, they are just listed
    
    Note that a feature is considered problematic if its VIF is greater than "thresh" (which is 5 by default, 
    as recommended in the documentation for the "variance_inflation_factor" method)
    '''
    cols = X.columns
    variables = np.arange(X.shape[1])
    # list of problematic predictors
    problems = []
    # drop features from "variables" until every VIF value is ok
    dropped=True
    while dropped:
        dropped=False
        c = X[cols[variables]].values
        vif = [variance_inflation_factor(c, ix) for ix in np.arange(c.shape[1])]

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            if drop and verbose:
                print('dropping \'' + X[cols[variables]].columns[maxloc] + '\' at index: ' + str(maxloc))
            problems.append(X[cols[variables]].columns[maxloc])
            variables = np.delete(variables, maxloc)
            dropped=True

    if drop:
        if verbose:
            print('Remaining variables:')
            print(X.columns[variables])
        return X[cols[variables]]
    else:
        return problems
We tried to run this test, but it seems that the majority of features have a VIF value much higher than 5. Also, when these features are removed the F-statistic for the model decrease to around 100.

We tried different threshold values, discovering that 60 is a value able to increase the F-statistic up to (more or less) 190. Anyway, we discarted that model: the one presented here has better performances on cross validation predictions over the validation set.

Here, you can see the F-statistic values obtained for the different thresholds

Let's continue

Linear dependency

It could be useful to understand if the dependency between features and target is linear or not. In order to check it, we can print the graph of the residuals (between targets and predictions) vs the predicted values: if the dependency is not linear, we could be able to see a graph who has not a linear behaviour

In [166]:
def residual_plot(y_pred, residuals, color="g", order=1):
    plt.figure(figsize=(14, 7))
    sns.regplot(x=y_pred, y=residuals, color=color, scatter_kws={"s": 80}, order=order, ci=None, truncate=True)
    plt.title('Residuals graph', size=18)
    plt.xlabel('Predicted SalePrice', size=14)
    plt.ylabel('Residual', size=14)
In [167]:
# evaluate the model (Multiple Linear Regression)
lr = linear_model.LinearRegression()
model = lr.fit(X, y)

# retrieve predictions: y^
Y_pred = lr.predict(X)

# evaluate the residuals
residuals = y - Y_pred
In [168]:
# plot the residual graph
residual_plot(Y_pred, residuals)
residual_plot(Y_pred, residuals, color='b', order=2)
It looks like we have quadratic dependencies: in fact, the regression line of order 2 is able to better fit the residuals (the green graph). We will consider it later, when building the model

Correlation of error terms

It is convenient to model error terms as uncorrelated random variables: to check if this assumption holds, we can see how the plot residuals vs dataset samples (tracking of the residuals) appear. If we are able to discern a pattern from plotted samples, it means we have correlation.

The reason we are going to perform this analysis is that, in case of correlated error terms, estimated standard errors would be underestimates of the true ones. Actually, this is a typical problem of time series (which is not our case) and we do not expect to have this kind of problem. Anyway, it could arise because of poor experimental design condition while collecting the data.

In [169]:
# plot the residuals tracking
plt.figure(figsize=(14,7))
plt.plot(residuals)
plt.title('Residuals tracking', size=18)
plt.xlabel('Dataset sample', size=14)
_ = plt.ylabel('Residual', size=14)

Fine. There is no evidence of error terms correlation

Heteroscedasticity

We want to see if the samples are not affected by heteroscedasticity: which means, that the variance associated to the various error components should be always the same (constant).

The way we can proceed to understand it, is to plot again the residuals graph: if the plot has a funnel shape, it means that the problem is present

In [170]:
# plot the residual graph
residual_plot(Y_pred, residuals, order=2)
We clearly have heteroscedasticity: we can solve the problem applying a concave function to transform the target. We choose to apply the square root function
In [172]:
# transform the response variable
Y2 = np.sqrt(y)
Y2_pred = np.sqrt(Y_pred)
residuals2 = Y2 - Y2_pred

# plot the residual graph
residual_plot(Y2_pred, residuals2, order=2)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in sqrt
  This is separate from the ipykernel package so we can avoid doing imports until

The problem is solved: now, plotted residuals have a more uniform disposal.

This is how the plot would have looked like using the natural logarithm:

In [173]:
# transform the response variable
Y2_log = np.log(y)
Y2_pred_log = np.log(Y_pred)
residuals2_log = Y2_log - Y2_pred_log

# plot the residual graph
residual_plot(Y2_pred_log, residuals2_log, color='r', order=2)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: invalid value encountered in log
  """

Outliers and high-leverage points

Before training the model, it is important to spot and remove outliers and high leverage points: in other words, points for which $y_i$ is very far from the value predicted by the model (outliers) or who have unusual values for the features (high leverage points).

Why outliers are a problem

Altough they may not influence how the linear regression model is built (but they could...), their presence makes the RSE (Residual Standard Error) to grow. Which means, we have problematics when we try to evaluate confidence intervals and p-values. Also the $R^2$ becomes worse.

They could be the result of a wrong observation, could arise because of anomalies or could simply represent phenomena that our model is not able to take into account. In a certain sense, they could be the evidence of the presence of flaws: so, we need to pay attention when we decide to drop the data.

Why high leverage points are a problem

They tend to have a sizable impact on how the regression line is evaluated: there is the possibility they invalidate the entire fit.

How to proceed

The Python library StatsModels provide a very powerful tool called Influence plot:

  • outliers are the points whose studentized residual is high (as a practical rule, it should normally be between -2 and +2)
  • high leverage points, instead, are characterized by high leverage associated values: they will be in the right-most part of the graph.
In [86]:
# to perform the analysis, we need to recompute the model specifically with StatsModel
lm = sm.OLS(Y2, X).fit()
In [87]:
# now, we can print the Influence plot
fig, ax = plt.subplots(figsize=(80,60))
fig = sm.graphics.influence_plot(lm, criterion="cooks", ax=ax)
plt.grid(True)
You can find the full resolution image here
In [62]:
# high leverage points: from right to left
hl_points = [
    313, 335, 249, 185, 635, 1173, 88, 934, 706, 170, 297, 197, 1031, 1009, 883, 523
]

# outliers
outliers = [
    970, 688, 1181, 142, 608, # top-left
    583, 496, 1182, # top
    714, 431, 560, 1062, 666, 874, 462, 410, 812, 968, 916, 588, 30, 632, # bottom-left
    523, #bottom
    495 #right
]

Example - The greatest outlier (record 523)

As an example, we try to understand why the apartment id=523 is an outlier

In [89]:
# a little analysis: the greatest outlier (523)
tmp = pd.options.display.max_columns
pd.options.display.max_columns = 999
print(dataset.iloc[523])
pd.options.display.max_columns = tmp
MSSubClass                60.0
LotFrontage              130.0
LotArea                40094.0
OverallQual               10.0
OverallCond                5.0
YearBuilt               2007.0
YearRemodAdd            2008.0
MasVnrArea               762.0
BsmtFinSF1              2260.0
BsmtUnfSF                878.0
TotalBsmtSF             3138.0
1stFlrSF                3138.0
2ndFlrSF                1538.0
GrLivArea               4676.0
BsmtFullBath               1.0
HalfBath                   1.0
BedroomAbvGr               3.0
TotRmsAbvGrd              11.0
Fireplaces                 1.0
GarageYrBlt             2007.0
GarageCars                 3.0
GarageArea               884.0
WoodDeckSF               208.0
OpenPorchSF              406.0
MoSold                    10.0
YrSold                  2007.0
LotShape_IR1               1.0
LotShape_Reg               0.0
LotConfig_Corner           0.0
LotConfig_Inside           1.0
HouseStyle_1Story          0.0
MasVnrType_BrkFace         0.0
MasVnrType_None            0.0
BsmtExposure_No            0.0
BsmtFinType1_GLQ           1.0
BsmtFinType1_Unf           0.0
HeatingQC_Ex               1.0
FireplaceQu_Gd             1.0
FireplaceQu_NA             0.0
GarageFinish_Fin           1.0
GarageFinish_RFn           0.0
SalePrice             184750.0
Name: 523, dtype: float64
In [90]:
# 184750
print('Avg SalePrice:')
print('{} (~184750)'.format(round(dataset_complete['SalePrice'].mean(), 2)))
Avg SalePrice:
181414.63 (~184750)

Apartment's SalePrice is in the average: but it looks like the building is quite recent and big, with the top "OverallQual"... maybe the price was too low?

In [91]:
# 2007
print('Avg yearBuilt:')
print('{} (< 2007)'.format(round(dataset_complete['YearBuilt'].mean(), 2)))
Avg yearBuilt:
1971.35 (< 2007)
In [92]:
# 40094
print('Avg LotArea:')
print('{} (< 40094)'.format(round(dataset_complete['LotArea'].mean(), 2)))
Avg LotArea:
10559.41 (< 40094)
In [93]:
# 10
print('Avg OverallQual:')
print('{} (< 10)'.format(round(dataset_complete['OverallQual'].mean(), 2)))
Avg OverallQual:
6.1 (< 10)
Condering the results obtained, we can clearly see that the price is too low

What we can think is that, maybe, we are not considering some important variables because not available (eg: the crime rate in the area where the apartment is located). It could also be a price forced by the economical crisis that started in the same year the apartment was sold: it may be that the owner, in economic need, had to sold it at a very convenient price.

Synergy effects

Sometimes, the target depends on the product of two or more features: which means, the model we are trying to build is not additive. This effect is called synergy and can be explained in this way: the response variable is affected by the fact that a bounch of features are acting together (at the same moment) or not.

We are not going to perform this kind of analysis: the reason is that with the number of features we have, it is not feasible to try to combine every couple of predictors as a new feature. We could have proceeded by attempts, randomly, but we decided to keep features as they are.


Data cleaning and preprocessing

What we are going to do now is try to fix all the problems spotted so far.

Heteroscedasticity

The first thing we do is to transform the response variable, applying the square root function. Now, our target variable will be "SalePrice_sqrt"

In [174]:
# store the original prices
original_target = dataset['SalePrice']
# work with the square root of the prices
dataset['SalePrice_sqrt'] = np.sqrt(dataset['SalePrice'])
dataset = dataset.drop(columns=['SalePrice'])

Outliers and high leverage points

We get rid of those points we have spotted before: they are potentially dangerous because they can significally alter how the regression model fits the data and reduce the model performances

In [ ]:
# high leverage points
dataset = dataset[~dataset.index.isin(hl_points)]
# outliers
dataset = dataset[~dataset.index.isin(outliers)]

Quadratic dependencies

We have spotted that some dependencies are actually non-linear: what we are going to do is to featurize the records, creating the square value for every quantitative feature. Then, we will perform the feature selection to get rid of those features that actually are not interesting in predicting the target.

In [176]:
for feature in dataset:
    # ignore the target
    if feature != 'SalePrice_sqrt':
        # consider only quantitative features
        if feature in features_metadata and features_metadata[feature]['continuous']:
            dataset['{}^2'.format(feature)] = dataset[feature]**2
In [177]:
print('We have now {} features'.format(dataset.shape[1]))
We have now 65 features

Residuals plot

Let us see how the residual plot looks like:

In [179]:
X, y = features_and_target(dataset, target='SalePrice_sqrt')

# evaluate the model (Multiple Linear Regression)
lr = linear_model.LinearRegression()
model = lr.fit(X, y)

# retrieve predictions: y^
Y_pred = lr.predict(X)

# evaluate the residuals
residuals = y - Y_pred
# plot them
residual_plot(Y_pred, residuals)
In [180]:
dataset.shape
Out[180]:
(1161, 65)
It looks like we can try to build a Linear Multiple Regression model: using the selected features, for 1161 out of 1200 records the model seems able to give an overview of the situation

The analysis is terminated: we want to reset the "SalePrice" field as our target

In [181]:
dataset['SalePrice'] = original_target
dataset = dataset.drop(columns='SalePrice_sqrt')


Building the model

Our strategy

Our starting point will be the Multiple Linear regression model developed so far. In order to try to improve its performances, we want to explore other kind of models and feature selection techniques.
Models obtained will be compared each other: the best ones will be merged together to create a complex model able to merge their predictions as an unique number.

In [67]:
from sklearn import ensemble, model_selection, decomposition, neighbors

In order to compute and parametrize the alternative models, we will use Spark. Our cluster can offer us a degree of parallelism equals to 12

In [63]:
parallelism = 12
# broadcast variables
global Xb
global yb

This function will be used to plot test errors

In [ ]:
def model_validation_results(y, y_pred, verbose=True):
    '''
    This method is used to inspect how accurate the predictions of a model are.
    In more details, the method prints to the user some information about the prediction errors and plot three graphs:
    - estimates and true values comparison
    - absolute errors
    - relative errors
    '''
    # compare predictions and test targets: evaluate the errors
    delta = np.abs(y - y_pred)
    abs_errors = list(delta)
    rel_errors = list(delta / y * 100)
    # compute some model quality parameter
    mse = round(np.dot(delta, delta)/len(delta), 2)
    max_error = round(max(abs_errors), 2)
    min_error = round(min(abs_errors), 2)
    mean_error = round(np.mean(abs_errors), 2)
    max_rel_error = round(max(rel_errors), 2)
    min_rel_error = round(min(rel_errors), 2)
    mean_rel_error = round(np.mean(rel_errors), 2)

    # print results
    if verbose:
        print('MSE: {}'.format(mse))
        print('\nMax absolute error: {}'.format(max_error))
        print('Min absolute error: {}'.format(min_error))
        print('Avg absolute error: {}'.format(mean_error))
        print('\nMax relative error: {}%'.format(max_rel_error))
        print('Min relative error: {}%'.format(min_rel_error))
        print('Avg relative error: {}%'.format(mean_rel_error))

        figsize=(80, 8)
        N = len(y)

        # estimated vs true values
        plt.figure(figsize=figsize)
        plt.title('Estimates and True SalePrices comparison', fontsize=22, y=1.02)
        plt.plot(y_pred, 'r', label='Estimated prices')
        plt.plot(list(y), 'g', label='True prices')
        plt.xlabel('Apartment', fontsize=16)
        plt.ylabel('Estimated and true SalePrices, in $', fontsize=16)
        plt.xlim(0, N)
        plt.legend()
        plt.grid(True)

        # absolute errors obtained
        plt.figure(figsize=figsize)
        plt.title('SalePrice absolute errors obtained', fontsize=22, y=1.02)
        plt.plot(abs_errors)
        plt.xlabel('Apartment', fontsize=16)
        plt.ylabel('Abs. error, in $', fontsize=16)
        plt.xlim(0, N)
        plt.grid(True)

        # relative errors obtained
        plt.figure(figsize=figsize)
        plt.title('SalePrice relative errors obtained', fontsize=22, y=1.02)
        plt.plot(rel_errors, 'g')
        plt.xlabel('Apartment', fontsize=16)
        plt.ylabel('Rel. error, in %', fontsize=16)
        plt.xlim(0, N)
        plt.grid(True)

    # return results
    return mse, max_error, min_error, mean_error, abs_errors, rel_errors

Multiple Linear Regression

DOCS: here
Let us see how well would perform a simple MLR model: in particular, the one we have considered so far during our analysis

In [182]:
# build our model
X, y = features_and_target(dataset)
model_mlr = linear_model.LinearRegression()
# fit the model (applying the square root to the target variable)
y_pred = model_selection.cross_val_predict(model, X, np.sqrt(y))**2
# see the prediction error
_ = model_validation_results(y, y_pred)
MSE: 513527726.46

Max absolute error: 185625.94
Min absolute error: 8.25
Avg absolute error: 15269.93

Max relative error: 54.05%
Min relative error: 0.0%
Avg relative error: 8.58%
Here the full resolution images
In [183]:
# keep track of the features used by this model
features_mlr = X.columns
In [184]:
print(features_mlr)
Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath',
       'HalfBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'MoSold',
       'YrSold', 'LotShape_IR1', 'LotShape_Reg', 'LotConfig_Corner',
       'LotConfig_Inside', 'HouseStyle_1Story', 'MasVnrType_BrkFace',
       'MasVnrType_None', 'BsmtExposure_No', 'BsmtFinType1_GLQ',
       'BsmtFinType1_Unf', 'HeatingQC_Ex', 'FireplaceQu_Gd', 'FireplaceQu_NA',
       'GarageFinish_Fin', 'GarageFinish_RFn', 'LotFrontage^2', 'LotArea^2',
       'YearBuilt^2', 'YearRemodAdd^2', 'MasVnrArea^2', 'BsmtFinSF1^2',
       'BsmtUnfSF^2', 'TotalBsmtSF^2', '1stFlrSF^2', '2ndFlrSF^2',
       'GrLivArea^2', 'BsmtFullBath^2', 'HalfBath^2', 'BedroomAbvGr^2',
       'TotRmsAbvGrd^2', 'Fireplaces^2', 'GarageYrBlt^2', 'GarageCars^2',
       'GarageArea^2', 'WoodDeckSF^2', 'OpenPorchSF^2', 'MoSold^2',
       'YrSold^2'],
      dtype='object')

We think the multiple linear regression model alone would perform quite poorly anyway. For this reason, we try to find some model alternatives and explore the sklearn-library

Alternative models

We would like to see if we are able to increase the prediction performances by averaging the estimates from several different models. Hence, our next step is exploring the models available inside the scikit-learn library: we will compare them and select the best ones.

Before starting, let's drop the outliers and high leverage points defining the base dataset versions on which we are going to work with

  • dataset_complete: contains all the features + all the apartments
  • dataset_complete_2: contains all the features and their squared values + all the apartments
  • dataset_cleaned: contains all the features, but no outliers/high leverage apartments
  • dataset_cleaned_2: contains all the features and their squared values, but no outliers/high leverage apartments
In [69]:
# create a version of the original dataset with squared features
dataset_complete_2 = dataset_complete.copy()

for feature in dataset_complete:
    # ignore the target
    if feature != 'SalePrice':
        # consider only quantitative features
        if feature in features_metadata and features_metadata[feature]['continuous']:
            dataset_complete_2['{}^2'.format(feature)] = dataset_complete_2[feature]**2
            
# filter out high leverage points and outliers
dataset_cleaned = dataset_complete.copy()
dataset_cleaned = dataset_cleaned[~dataset_cleaned.index.isin(outliers)]
dataset_cleaned = dataset_cleaned[~dataset_cleaned.index.isin(hl_points)]

# filter out high leverage points and outliers
dataset_cleaned_2 = dataset_complete_2.copy()
dataset_cleaned_2 = dataset_cleaned_2[~dataset_cleaned_2.index.isin(outliers)]
dataset_cleaned_2 = dataset_cleaned_2[~dataset_cleaned_2.index.isin(hl_points)]
The reason we can afford to replicate the dataset in memory 4 times is because the dataset is not very big: replicating it, we can speed up the model comparisons (without the need to repristinate the dataset variants in runtime, when needed)
In [70]:
def tuning_plot(params, model_scores, model_name, param_name):
    '''
    Plot a graph in which the obtained model scores are shown, according to its parameter's values
    '''
    plt.figure(figsize=(15, 8))
    plt.title('Tuning the model: {}'.format(model_name), fontsize=22, y=1.02)
    plt.plot(params, model_scores)
    plt.xlabel(param_name, fontsize=16)
    plt.ylabel('Model score', fontsize=16)
    plt.grid(True)

We now start to consider the alternatives

Gradient Boosting Regressor, using different features

We try to use the GradientBoostingRegressor, instead of the ExtraTreesClassifier, to determine the features importance values. We will use it also as regression model to fit the data: squared features are considered

DOCS: here

In [185]:
# isolate features and target
X, y = features_and_target(dataset_cleaned_2)
# feature importance with GradientBoostingRegressor
res = feature_importance(X, y, method='gradBoosting')
m = np.mean(res)
You can find the full resolution image here
In [143]:
# keep only interesting features
dataset = dataset_cleaned_2.filter(items=np.append(X.columns[res > 2*m].values, 'SalePrice'))
X, y = features_and_target(dataset)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)
# selected features
features_gbr = X.columns

# learning rates
lrs = np.linspace(1e-3, 1, 100)
# create the instances for the models (Gradient Boosting)
models = list(map(lambda lr: ensemble.GradientBoostingRegressor(learning_rate=lr), lrs))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_lr_gbr = lrs[np.argmax(scores)]
print('Best learning rate found: {}\n'.format(best_lr_gbr))
# the best model obtained
model_gbr = models[np.argmax(scores)]

# plot the scores, according to the tuning parameter
tuning_plot(lrs, scores, 'Gradient Boosting Regressor', 'Learning rate ($\lambda$)')
Best learning rate found: 0.112

In [187]:
print(features_gbr)
Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '2ndFlrSF',
       'GrLivArea', 'KitchenAbvGr', 'GarageYrBlt', 'GarageCars', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF', 'ScreenPorch', 'MoSold', 'MSZoning_FV',
       'LotConfig_CulDSac', 'Neighborhood_Crawfor', 'Neighborhood_Mitchel',
       'Condition1_Artery', 'Exterior1st_BrkFace', 'BsmtQual_Ex',
       'BsmtExposure_Gd', 'BsmtFinType1_GLQ', 'KitchenQual_Ex',
       'Functional_Typ', 'SaleType_New', 'SaleCondition_Abnorml',
       'LotFrontage^2', 'LotArea^2', 'YearBuilt^2', 'YearRemodAdd^2',
       'BsmtFinSF1^2', 'BsmtUnfSF^2', 'TotalBsmtSF^2', '1stFlrSF^2',
       '2ndFlrSF^2', 'GrLivArea^2', 'GarageYrBlt^2', 'GarageCars^2',
       'GarageArea^2', 'WoodDeckSF^2', 'OpenPorchSF^2', 'ScreenPorch^2',
       'MoSold^2'],
      dtype='object')

Principal Component Analysis

Instead of selecting a subset of the original features, we try to select the best "K" principal components and use them as new predictors.
The value of "K" is tuned using cross-validation.

DOCS: here

In [217]:
# isolate features and targets
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)

# number of principal components
Ns = list(range(1, 21))
# instanciate PCAs
PCAs = list(map(lambda n: decomposition.PCA(n_components=n), Ns))
rdd = sc.parallelize(PCAs, parallelism)
# apply PCA decomposition and score the LMR models
scores = rdd.map(lambda pca: model_selection.cross_val_score(
                                                linear_model.LinearRegression(),
                                                pca.fit_transform(Xb.value),
                                                yb.value
                                            ).mean()
                ).collect()
# find the best learning rate
best_n_pca = Ns[np.argmax(scores)]
print('Best N-value found: {}\n'.format(best_n_pca))

# plot the scores, according to the tuning parameter
tuning_plot(Ns, scores, 'PCA decomposition', 'Number of components')
Best N-value found: 14

K-Nearest Neighbors Regressor

The KNN Regression model is more flexible than the Multiple Linear Regression one: it means that, when the relation between features and target is highly non-linear, this method can be much more performant. We want to try it and see how it performs.

We are going to work with a dataset

  • with the features selected by the MLR model
  • with no outliers and HL-points

DOCS: here

In [144]:
# keep only interesting features
dataset = dataset_cleaned_2.filter(items=np.append(features_mlr, 'SalePrice'))
X, y = features_and_target(dataset)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)

# K values
Ks = list(range(1, 11))
# create the instances for the models (KNN)
models = list(map(lambda k: neighbors.KNeighborsRegressor(n_neighbors=k), Ks))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best K value
best_k_knn = Ks[np.argmax(scores)]
print('Best K value found: {}\n'.format(best_k_knn))
# the best model obtained
model_knn = models[np.argmax(scores)]

# plot the scores, according to the tuning parameter
tuning_plot(Ks, scores, 'K-Nearest Neighbors Regressor', 'N. neighbors (K)')
Best K value found: 5

Lasso

This technique works in a similar way to the Multiple Linear Regression model, but:

  • it can automatically detect which are the most useful features when predicting the target (ignoring the other ones)
  • the loss function is regularized (with an L1 penalty)

The predictions are retrieved according to this formula $$ \hat{y_i} = \beta_0 + \sum_{j=1}^p \beta_jx_{ij} $$

And the loss function becomes: $$ \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^p |\beta_j| $$

Where:

  • $\hat{y_i}$ is the i-th prediction, fot the dataset record $x_i$
  • $y_i$ is the i-th target, fot the dataset record $x_i$
  • $\beta_i$ is the i-th regression coefficient
  • $p$ is the number of features we have
  • $n$ is the number of records inside the dataset
  • $\lambda$ is our tuning parameter (used to determine how many features we want to keep)

We are going to use the dataset with all the features (also the square values), but no outliers and HL-points

DOCS: here

In [149]:
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)

# learning rates
lrs = np.linspace(0.7, 70, 100)
# create the instances for the models (Lasso)
models = list(map(lambda lr: linear_model.Lasso(alpha=lr), lrs))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_lr_las = lrs[np.argmax(scores)]
print('Best learning rate found: {}\n'.format(best_lr_las))
# the best model obtained
model_las = models[np.argmax(scores)]

# plot the scores, according to the tuning parameter
tuning_plot(lrs, scores, 'Lasso', 'Penalty coefficient ($\lambda$)')
Best learning rate found: 21.7

Ridge Regression

A technique very similar to Lasso: the differences are that

  • all the features are used to compute the prediction
  • the shrinkage penalty added to the loss function is type L2

So, the loss function now becomes: $$ \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^p \beta_j^2 $$

We are going to use the dataset with all the features (also the square values), but no outliers and HL-points

DOCS: here

In [150]:
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)

# learning rates
lrs = np.linspace(0.7, 70, 100)
# create the instances for the models (Ridge)
models = list(map(lambda lr: linear_model.Ridge(alpha=lr), lrs))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_lr_rid = lrs[np.argmax(scores)]
print('Best learning rate found: {}\n'.format(best_lr_rid))
# the best model obtained
model_rid = models[np.argmax(scores)]

# plot the scores, according to the tuning parameter
tuning_plot(lrs, scores, 'Ridge Regression', 'Penalty coefficient ($\lambda$)')
Best learning rate found: 10.499999999999998

Bagging Regressor

We are now starting to explore ensemble methods: the first one we analyse is the Bagging Regression, built combining together Decision Tree Regressors.

Bagging's objective is trying to reduce the final model's error variance: the base regressors (Decision Trees, in our case) work on different portions of the dataset, fitting their own model. When performing the prediction task, all the predictions coming from these models are averaged together.

We are going to use the dataset with all the features (also the square values), but no outliers and HL-points

DOCS: here

In [71]:
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)

# number of base regressors to use
Ns = list(range(50, 5000, 100))
# create the instances for the models (Bagging)
models = list(map(lambda n: ensemble.BaggingRegressor(bootstrap=True, n_estimators=n), Ns))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best number of base regressors
best_n_bag = Ns[np.argmax(scores)]
print('Best N-value found: {}\n'.format(best_n_bag))
# the best model obtained
model_bag = models[np.argmax(scores)]

# plot the scores, according to the tuning parameter
tuning_plot(Ns, scores, 'Bagging Regressor', 'Number of base regressors (n)')
Best N-value found: 2350

AdaBoost Regressor

We now consider is the AdaBoost Regression built combining together Decision Tree Regressors.

AdaBoost is a famous boosting technique, whose objective is trying to reduce the final model's error variance. it is something similar to bagging, but here trees are built sequentially. The shrinkage parameter ($\lambda$) has to be tuned properly.

We are going to use the dataset with all the features (also the square values), but no outliers and HL-points. The number of base estimators will be the same found for the Bagging Regressor

DOCS: here

In [83]:
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)

# learning rates to try
lrs = np.linspace(1e-3, 2, 100)
# create the instances for the models (AdaBoost)
models = list(map(lambda lr: ensemble.AdaBoostRegressor(learning_rate=lr, n_estimators=best_n_bag), lrs))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_lr_ada = lrs[np.argmax(scores)]
print('Best learning rate found: {}\n'.format(best_lr_ada))
# the best model obtained
model_ada = models[np.argmax(scores)]

# plot the scores, according to the tuning parameter
tuning_plot(lrs, scores, 'AdaBoost Regressor', 'Learning rate ($\lambda$)')
Best learning rate found: 0.041383838383838384

Random Forest Regressor

This methods use a forest (set of Decision Trees) to perform the Regression: base regressors predictions are averaged together.

We are going to use the dataset with all the features (also the square values), but no outliers and HL-points. The tuning parameter is the number of base regressors to instanciate.

DOCS: here

In [87]:
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)

# learning rates to try
Ns = list(range(10, 10000, 100))
# create the instances for the models (Random Forest)
models = list(map(lambda n: ensemble.ExtraTreesRegressor(bootstrap=True, n_estimators=n), Ns))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_n_rfr = Ns[np.argmax(scores)]
print('Best N-value found: {}\n'.format(best_n_rfr))
# the best model obtained
model_rfr = models[np.argmax(scores)]

# plot the scores, according to the tuning parameter
tuning_plot(Ns, scores, 'Random Forest Regressor', 'Number of base regressors (n)')
Best N-value found: 410

Extra Trees Regressor

A technique similar to the Random Forset one: the difference is how the dataset is split when assigned to the base regressors. Here, the thresholds for candidate features are more randomic.

We are going to use the dataset with all the features (also the square values), but no outliers and HL-points. The tuning parameter is the number of base regressors to instanciate.

DOCS: here

In [71]:
X, y = features_and_target(dataset_cleaned_2)
Xb = sc.broadcast(X)
yb = sc.broadcast(y)

# learning rates to try
Ns = list(range(10, 10000, 100))
# create the instances for the models (Bagging)
models = list(map(lambda n: ensemble.RandomForestRegressor(bootstrap=True, n_estimators=n), Ns))
rdd = sc.parallelize(models, parallelism)
# score the models
scores = rdd.map(lambda mod: model_selection.cross_val_score(mod, Xb.value, yb.value).mean()).collect()
# find the best learning rate
best_n_etr = Ns[np.argmax(scores)]
print('Best N-value found: {}\n'.format(best_n_etr))
# the best model obtained
model_etr = models[np.argmax(scores)]

# plot the scores, according to the tuning parameter
tuning_plot(Ns, scores, 'Extremely Randomized Trees Regressor', 'Number of base regressors (n)')
Best N-value found: 2110

Models comparison

Our model alternatives are ready to be compared each other. In order to perform it efficiently, we create a wrapper class called CompareModels whose primary goal is to plot the $R^2$ cross-validation model scores.

The higher the score, the higher the model performances in predicting apartments SalePrices.

In [84]:
class CompareModels(object):
    '''
    This class can be used as a wrapper to compare alternative models. 
    The comparison can be done quickly by plotting the scores for each model.
    '''
    def __init__(self, X, y, n_batch=3):
        '''
        Instanciate the wrapper, assigning to it the dataset (split in X and y) 
        and deciding the number of batches to be used during cross-validation
        '''
        self.X = X
        self.y = y
        self.n_batch = n_batch
        self.models = []
    
    def add_model(self, model, name, f_X=None, f_y=None):
        '''
        Attach one model to the wrapper. You need to provide both the model instance 
        and the model name to display in the graph.
        Optionally, you can specify the functions to be applied a priori:
        - f_X, on predictors
        - f_y, on targets
        '''
        # if not specify, the two functions are the identify function (no input transformation)
        if f_X is None:
            f_X = lambda x: x
        if f_y is None:
            f_y = lambda y: y
        
        self.models.append(
            {
                'name': name,
                'instance': model,
                'f_X': f_X,
                'f_y': f_y,
                'score': None
            }
        )
    
    def score(self):
        '''
        Score every model assigned to the class performing cross-validation.
        Models with highest performances are the ones with highest scores (bars) inside the plot.
        '''
        # compute scores
        for m in self.models:
            # validate the model with cross-validation
            m['score'] = sklearn.model_selection.cross_val_score(m['instance'], m['f_X'](self.X), m['f_y'](self.y), cv=self.n_batch).mean()
        # plot results
        plt.figure(figsize=(15, 8))
        plt.title('Model comparisons', fontsize=22, y=1.02)
        sns.barplot(
            x = list(map(lambda m: m['name'], self.models)),
            y = list(map(lambda m: m['score'], self.models))
        )
        plt.xlabel('Regression model', fontsize=16)
        plt.ylabel('Cross-validation $R^2$ score', fontsize=16)
        plt.xticks(rotation='90')
        plt.grid(True)
In [72]:
# methods to apply in order to transform the dataset
def f_X_mlr(X):
    return X.filter(items=features_mlr)

def f_X_gbr(X):
    return X.filter(items=features_gbr)

def f_X_pca(X):
    return sklearn.decomposition.PCA(n_components=best_n_pca).fit_transform(X)
    
def f_X_knn(X):
    return X.filter(items=features_mlr)
In [224]:
# instanciate the wrapper
X, y = features_and_target(dataset_cleaned_2)
myModels = CompareModels(X, y)

# add the previous model alternatives to the wrapper
myModels.add_model(model_mlr, 'Multiple Linear Regression', f_X=f_X_mlr, f_y=np.sqrt)
myModels.add_model(model_gbr, 'Gradient Boosting ($\lambda$={})'.format(best_lr_gbr), f_X=f_X_gbr)
myModels.add_model(model_pca, 'PCA + Multiple Linear Regression', f_X=f_X_pca)
myModels.add_model(model_knn, 'K-Nearest Neighbor (K={})'.format(best_k_knn), f_X=f_X_knn)
myModels.add_model(model_las, 'Lasso ($\lambda$={})'.format(best_lr_las))
myModels.add_model(model_rid, 'Ridge ($\lambda$={})'.format(best_lr_rid))
myModels.add_model(model_bag, 'Bagging (n={})'.format(best_n_bag))
myModels.add_model(model_ada, 'AdaBoost (n={}, $\lambda$={})'.format(best_n_bag, round(best_lr_ada, 4)))
myModels.add_model(model_rfr, 'Random Forest (n={})'.format(best_n_rfr))
myModels.add_model(model_etr, 'Extra Trees (n={})'.format(best_n_etr))
In [ ]:
# score the models and compare them each other
myModels.score()

In [233]:
# see the top-6 models and their scores
for m in sorted(myModels.models, key=lambda x: x['score'], reverse=True)[:6]:
    print('{}: {}'.format(m['name'], m['score']))
Ridge ($\lambda$=10.5): 0.9175522798297749
Multiple Linear Regression: 0.9170442646278719
Gradient Boosting ($\lambda$=0.112): 0.9128329290110689
Lasso ($\lambda$=21.7): 0.9052626448380366
Bagging (n=1550): 0.8900496367053027
Extra Trees (n=2110): 0.8895333692160445


Here we are: we can see that the 6 most performant models are:

  • the Ridge model
  • the Multiple Linear Reggression model built at the start of the notebook
  • the Gradient Boosting Regressor
  • the Lasso model
  • the Bagging Regressor
  • the Extra Trees Regressor

The final model

We have analysed several regression model alternatives: we have created them and we have selected only the ones with best performances. The next step is to merge these models together, creating our final model to be used for SalePrice predictions.

Deciding the weights

The final prediction will be the weighted average of the predictions given by the base regressor we have chosen. In order to decide the weights, the idea is to use the base predictions as predictors for the target.
In this way, we will be able to see which of these model has major impact in order to establish the final prediction.

The easiest way to proceed would be to simply average these values, assigning each weight the same number. Anyway, we decide to proceed following a more general approach.
In our code, feature importance is evaluated using the Extra Trees class from scikit-learn.

In [ ]:
preds_as_X = pd.DataFrame({
    'Ridge': sklearn.model_selection.cross_val_predict(model_rid, X, y),
    'MLR': sklearn.model_selection.cross_val_predict(model_mlr, f_X_mlr(X), np.sqrt(y))**2,
    'GradBoost': sklearn.model_selection.cross_val_predict(model_gbr, f_X_gbr(X), y),
    'Lasso': sklearn.model_selection.cross_val_predict(model_las, X, y),
    'Bagging': sklearn.model_selection.cross_val_predict(model_bag, X, y),
    'ExtraTrees': sklearn.model_selection.cross_val_predict(model_etr, X, y)
})
In [81]:
# compute the weights as feature importances
final_model_weights = feature_importance(preds_as_X, y, figsize=(15, 8))
In [82]:
print(final_model_weights)
[0.1674276  0.16744832 0.16714102 0.16505617 0.16698126 0.16594564]

The Python class

This model will be realized as a class, called ApartmentPricePredictor

In [73]:
class ApartmentPricePredictor(object):
    '''
    This class is used to instantiate a model used to predict SalePrices. It is composed by six different base regressors, 
    whose predictions are averaged together as final result.
    These models are the six we have discovered above, with best performances.
    
    WARNING: not intended to work as a stand-alone class!
    In fact, internal values refers to variables whose values have been computed before (eg: the optimal models).
    '''
    def __init__(self):
        '''
        The six internal base regressors are created, using previous computed results
        '''
        # specify if the model has already been trained
        self.trained = False
        # Bagging
        self.bag = {
            'instance': model_bag,
            'model': None,
            'weight': final_model_weights[0]
        }
        # Extra Trees Regressor
        self.etr = {
            'instance': model_etr,
            'model': None,
            'weight': final_model_weights[1]
        }
        # Gradient Boosting Regressor
        self.gbr = {
            'instance': model_gbr,
            'model': None,
            'weight': final_model_weights[2]
        }
        # Lasso
        self.las = {
            'instance': model_las,
            'model': None,
            'weight': final_model_weights[3]
        }
        # Multiple Linear Regression model
        self.mlr = {
            'instance': model_mlr, 
            'model': None,
            'weight': final_model_weights[4]
        }
        # Ridge
        self.rid = {
            'instance': model_rid, 
            'model': None,
            'weight': final_model_weights[5]
        }
        
        
    def train(self, X, y):
        '''
        Train the model over the specified dataset
        '''
        # Bagging
        self.bag['model'] = self.bag['instance'].fit(X, y)
        # Extra Trees
        self.etr['model'] = self.etr['instance'].fit(X, y)
        # Gradient Boosting: select a subset of the features
        self.gbr['model'] = self.gbr['instance'].fit(f_X_gbr(X), y)
        # Lasso
        self.las['model'] = self.las['instance'].fit(X, y)
        # MLR model: select a subset of the features and works with the square root of the SalePrice
        self.mlr['model'] = self.mlr['instance'].fit(f_X_mlr(X), np.sqrt(y))
        # Ridge
        self.rid['model'] = self.rid['instance'].fit(X, y)
        # mark the model as trained
        self.trained = True
        
    
    def predict(self, X):
        '''
        Run predictions using the trained model
        '''
        # the model must have been trained
        if not self.trained:
            raise RuntimeError('The model has never been trained before!')
        
        # evaluate the predictions
        def _pred(m, X, f_X=None, f_y=None):
            '''
            Internally used to evaluate the model prediction "p", over the predictors "X".
            If you need to manipulate the predictions, pass a function to the "f_y" parameter
            '''
            X_ = X if f_X is None else f_X(X)
            p = m['model'].predict(X_)
            if f_y is not None:
                p = f_y(p)
            return p * m['weight']
        
        # Bagging
        bag_pred = _pred(self.bag, X)
        # Extra Trees
        etr_pred = _pred(self.etr, X)
        # Gradient Boosting: select a subset of the features
        gbr_pred = _pred(self.gbr, X, f_X=f_X_gbr)
        # Lasso
        las_pred = _pred(self.las, X)
        # MLR model: it works predicting the square root of the price
        mlr_pred = _pred(self.mlr, X, f_X=f_X_mlr, f_y=lambda y: y**2)
        # Ridge
        rid_pred = _pred(self.rid, X)
        
        # merge these predictions
        return np.sum(np.array([bag_pred, etr_pred, gbr_pred, las_pred, mlr_pred, rid_pred]), axis=0)
    
    
    def cross_val_predict(self, X, y, cv=3):
        '''
        Run predictions using the trained model (with cross-validation)
        '''
        # evaluate the predictions
        def _pred(m, X, y, f_X=None, f_y_priori=None, f_y_posteriori=None):
            '''
            Internally used to evaluate the model prediction "p", over the predictors "X".
            If you need to manipulate the predictions, pass a function to the "f_y" parameter
            '''
            # if not specified, "f"s are the identity functions
            X_ = X if f_X is None else f_X(X)
            y_ = y if f_y_priori is None else f_y_priori(y)
            # perform cross validation to make predictions
            res = sklearn.model_selection.cross_val_predict(m['instance'], X_, y_, cv=cv)
            if f_y_posteriori is not None:
                res = f_y_posteriori(res)
            # final result
            return res * m['weight']
        
        # Bagging
        bag_pred = _pred(self.bag, X, y)
        # Extra Trees
        etr_pred = _pred(self.etr, X, y)
        # Gradient Boosting: select a subset of the features
        gbr_pred = _pred(self.gbr, X, y, f_X=f_X_gbr)
        # Lasso
        las_pred = _pred(self.las, X, y)
        # MLR model: it works predicting the square root of the price
        mlr_pred = _pred(self.mlr, X, y, f_X=f_X_mlr, f_y_priori=np.sqrt, f_y_posteriori=lambda y: y**2)
        # Ridge
        rid_pred = _pred(self.rid, X, y)
        
        # merge these predictions
        return np.sum(np.array([bag_pred, etr_pred, gbr_pred, las_pred, mlr_pred, rid_pred]), axis=0)
    
    
    def validate(self, X, y, cv=3, verbose=True):
        '''
        Perform the cross-validation to validate the model and print obtained results
        '''
        # perform predictions
        y_pred = self.cross_val_predict(X, y, cv=cv)
        # compare predictions and test targets: evaluate the errors
        return model_validation_results(y, y_pred)
In [74]:
X_train, y_train = features_and_target(dataset_cleaned_2)
final_model = ApartmentPricePredictor()
In [112]:
_ = final_model.validate(X_train, y_train)
MSE: 436038772.33

Max absolute error: 154789.38
Min absolute error: 26.91
Avg absolute error: 13650.45

Max relative error: 59.62%
Min relative error: 0.02%
Avg relative error: 7.57%
Here the full resolution images

Our experiment produced a nice result: the final model performances are better than the ones of the Multiple Linear Regression one. We succeeded in reducing both the model's MSE and the average errors.


Final result

We can now perform the predictions on the given testset.

Test set manipulation

Before predicting, it is important to adapt the features to the ones we have used during the training. It means that:

  • we need to filter out those features that did not exist during the training phase
  • we need to introduce those features that existed (only) during the training phase
  • we have to generate the quadratic versione for the quantitative features
In [109]:
# the test set may have features which not exist in the training set
X_test = testset_complete.filter(items=dataset_complete_2.columns)

# generate quadratic features for the test set
for feature in X_test:
    # ignore the target
    if feature != 'SalePrice':
        # consider only quantitative features
        if feature in features_metadata and features_metadata[feature]['continuous']:
            X_test['{}^2'.format(feature)] = X_test[feature]**2
In [110]:
# some non-existing features of the test set may actually have been present during the training phase
missing_features = set(dataset_complete_2.columns) - set(X_test.columns) - set(['SalePrice'])
print('{} missing features:'.format(len(missing_features)))
print(missing_features)
40 missing features:
{'SaleType_ConLw', 'RoofMatl_Membran', 'RoofMatl_Metal', 'Condition2_PosA', 'Electrical_FuseP', 'Street_Grvl', 'PoolQC_Ex', 'Condition2_PosN', 'Utilities_NoSeWa', 'Exterior2nd_Other', 'BsmtCond_Po', 'Exterior2nd_BrkComm', 'Condition2_Artery', 'Exterior2nd_AsphShn', 'GarageType_2Types', 'Electrical_Mix', 'Exterior1st_AsphShn', 'ExterCond_Po', 'Condition2_RRNn', 'RoofMatl_WdShngl', 'SaleType_ConLD', 'Condition2_RRAn', 'Condition2_Feedr', 'GarageQual_Po', 'MiscFeature_Othr', 'Exterior1st_Stone', 'GarageCond_Ex', 'GarageQual_Ex', 'Exterior1st_BrkComm', 'Functional_Sev', 'Exterior2nd_ImStucc', 'Functional_Maj2', 'SaleType_Con', 'Exterior1st_ImStucc', 'SaleCondition_AdjLand', 'Neighborhood_Blueste', 'HeatingQC_Po', 'ExterCond_Ex', 'Condition1_RRNe', 'SaleType_ConLI'}

All of these features are actually dummy ones, generated starting from qualitative predictors.
It is safe to set them all to zero

In [111]:
# set dummy features to zero
for feature in missing_features:
    X_test[feature] = 0
In [132]:
# we need now to adapt the DataFrame schema
foo = pd.DataFrame()

for c in range(len(X_test.columns)):
    c_name = X_train.columns[c]
    foo[c_name] = X_test[c_name]
    
# check that data have been copied correctly
for c in X_test.columns:
    assert np.all(X_test[c] == foo[c])
# check the column are in the correct order
assert np.all(foo.columns == X_train.columns)

# we do not need "foo" anymore
X_test = foo
foo = None

Retrieving the predictions

We can use the entire dataset as training set: predictions will be evaluated over the test set

In [ ]:
# train the model
final_model.train(X_train, y_train)
In [146]:
# retrieve the predictions
final_result = np.round(final_model.predict(X_test))
print(final_result[:5])
[100602. 198901. 125018. 204465. 145312.]

Predictions as CSV file

As required, we have to save our predictions as a CSV file with header (Id, SalePrice)

In [159]:
pd.DataFrame({
    'Id': testset_Ids,
    'SalePrice': final_result
}).to_csv('final_predictions.csv', index=False)


Final considerations

Although a lot of things have been considered and analysed, several other alternative techniques could have been tested and explored. We had not enough time to try to perform the synergy effects analysis or to try other configurations for the alternative models we had developed.

For example, we could have considered what happens trying to keep only a subset of the features for models like the Extra Trees Regressor: or if some of the scikit-learn solutions can work well also in presence of outliers and high-leverage points.

Developing this notebook has been the opportunity for us to learn a lot of things. We are starting to feel much more confident when facing more complex machine learning problems. Thank you for the efforts you put in setting up the challenge for us.